Titanic: tutorial and examples

imports

In [1]:
import dalex as dx # version 0.1.4

import pandas as pd
import numpy as np

from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

import warnings
warnings.filterwarnings('ignore')

load data

First, divide the data into variables X and a target variable y.

In [2]:
data = dx.datasets.load_titanic()

X = data.drop(columns='survived')
y = data.survived
In [3]:
data.head(10)
Out[3]:
gender age class embarked fare sibsp parch survived
0 male 42.0 3rd Southampton 7.1100 0 0 0
1 male 13.0 3rd Southampton 20.0500 0 2 0
2 male 16.0 3rd Southampton 20.0500 1 1 0
3 female 39.0 3rd Southampton 20.0500 1 1 1
4 female 16.0 3rd Southampton 7.1300 0 0 1
5 male 25.0 3rd Southampton 7.1300 0 0 1
6 male 30.0 2nd Cherbourg 24.0000 1 0 0
7 female 28.0 2nd Cherbourg 24.0000 1 0 1
8 male 27.0 3rd Cherbourg 18.1509 0 0 1
9 male 20.0 3rd Southampton 7.1806 0 0 1

create a pipeline model

  • numerical_transformer pipeline:

    • numerical_features: choose numerical features to transform
    • impute missing data with median strategy
    • scale numerical features with standard scaler
  • categorical_transformer pipeline:

    • categorical_features: choose categorical features to transform
    • impute missing data with 'missing' string
    • encode categorical features with one-hot
  • aggregate those two pipelines into a preprocessor using ColumnTransformer

  • make a basic classifier model using MLPClassifier - it has 3 hidden layers with sizes 150, 100, 50 respectively
  • construct a clf pipeline model, which combines the preprocessor with the basic classifier model
In [4]:
numerical_features = ['age', 'fare', 'sibsp', 'parch']
numerical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]
)

categorical_features = ['gender', 'class', 'embarked']
categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

classifier = MLPClassifier(hidden_layer_sizes=(150,100,50), max_iter=500, random_state=0)

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', classifier)])

fit the model

In [5]:
clf.fit(X, y)
Out[5]:
Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                                 verbose=0)),
                                                                  ('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean...
                               batch_size='auto', beta_1=0.9, beta_2=0.999,
                               early_stopping=False, epsilon=1e-08,
                               hidden_layer_sizes=(150, 100, 50),
                               learning_rate='constant',
                               learning_rate_init=0.001, max_fun=15000,
                               max_iter=500, momentum=0.9, n_iter_no_change=10,
                               nesterovs_momentum=True, power_t=0.5,
                               random_state=0, shuffle=True, solver='adam',
                               tol=0.0001, validation_fraction=0.1,
                               verbose=False, warm_start=False))],
         verbose=False)

create an explainer for the model

In [6]:
exp = dx.Explainer(clf, X, y, label = "Titanic MLP Pipeline")
Preparation of a new explainer is initiated

  -> label             : Titanic MLP Pipeline
  -> data              : 2207 rows 7 cols
  -> target variable   : Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> predict function  : <function yhat_proba at 0x00000186E8418288> will be used
  -> predicted values  : min = 2.7205375671318212e-06, mean = 0.3367353380775521, max = 0.9999999997383016
  -> residual function : difference between y and yhat
  -> residuals         : min = -0.921242269825129, mean = -0.01457856417632876, max = 0.9751663054089054
  -> model_info        : package sklearn

A new explainer has been created!

dalex functions

image.png

Above functions are accessible from the Explainer object through its methods.

Each of them returns a new unique object that contains a result field in the form of a pandas.DataFrame and a plot method.

predict

This function is nothing but normal model prediction, however it uses Explainer interface.

Let's create two example persons for this tutorial.

In [7]:
john = pd.DataFrame({'gender': ['male'],
                       'age': [25],
                       'class': ['1st'],
                       'embarked': ['Southampton'],
                       'fare': [72],
                       'sibsp': [0],
                       'parch': 0},
                      index = ['John'])
In [8]:
mary = pd.DataFrame({'gender': ['female'],
                     'age': [35],
                     'class': ['3st'],
                     'embarked': ['Cherbourg'],
                     'fare': [25],
                     'sibsp': [0],
                     'parch': [0]},
                     index = ['Mary'])

You can make a prediction on many samples at the same time

In [9]:
exp.predict(X)[0:10]
Out[9]:
array([0.07907226, 0.20628711, 0.13463174, 0.60372994, 0.76485216,
       0.16150944, 0.03705073, 0.99324938, 0.19563509, 0.12184964])

As well as on only one instance. However, the only accepted format is pandas.DataFrame.

Prediction of survival for John.

In [10]:
exp.predict(john)
Out[10]:
array([0.08127727])

Prediction of survival for Mary.

In [11]:
exp.predict(mary)
Out[11]:
array([0.97830209])

predict_parts

  • 'break_down'

  • 'break_down_interactions'

  • 'shap'

This function calculates Variable Attributions as Break Down, iBreakDown or Shapley Values explanations.

Model prediction is decomposed into parts that are attributed for particular variables.

In [12]:
bd_john = exp.predict_parts(john, type='break_down')
bd_interactions_john = exp.predict_parts(john, type='break_down_interactions')

sh_mary = exp.predict_parts(mary, type='shap', B = 10)
In [13]:
bd_john.result.label = "John"
bd_interactions_john.result.label = "John+"

bd_john.result
Out[13]:
variable_name variable_value variable cumulative contribution sign position label
0 intercept 1 intercept 0.336735 0.336735 1.0 8 John
1 class 1st class = 1st 0.583093 0.246358 1.0 7 John
2 age 25.0 age = 25.0 0.595401 0.012308 1.0 6 John
3 sibsp 0.0 sibsp = 0.0 0.585751 -0.009650 -1.0 5 John
4 fare 72.0 fare = 72.0 0.319029 -0.266722 -1.0 4 John
5 parch 0.0 parch = 0.0 0.300772 -0.018257 -1.0 3 John
6 embarked Southampton embarked = Southampton 0.284191 -0.016580 -1.0 2 John
7 gender male gender = male 0.081277 -0.202914 -1.0 1 John
8 prediction 0.081277 0.081277 1.0 0 John
In [14]:
bd_john.plot(bd_interactions_john)
In [15]:
sh_mary.result.label = "Mary"

sh_mary.result.loc[sh_mary.result.B == 0, ]
Out[15]:
variable contribution variable_name variable_value sign label B
4 age = 35.0 0.070339 age 35 1.0 Mary 0
2 class = 3st 0.176106 class 3st 1.0 Mary 0
5 embarked = Cherbourg 0.078406 embarked Cherbourg 1.0 Mary 0
1 fare = 25.0 0.100320 fare 25 1.0 Mary 0
0 gender = female 0.077791 gender female 1.0 Mary 0
3 parch = 0.0 0.126529 parch 0 1.0 Mary 0
6 sibsp = 0.0 0.012076 sibsp 0 1.0 Mary 0
In [16]:
sh_mary.plot(bar_width = 16)
In [17]:
sh_john = exp.predict_parts(john, type='shap', B = 10)
sh_john.result.label = "John"
sh_john.plot(max_vars=5)

predict_profile

  • 'ceteris_paribus'

This function computes individual profiles aka Ceteris Paribus Profiles.

In [18]:
cp_mary = exp.predict_profile(mary)
cp_john = exp.predict_profile(john)

cp_mary.result.head()
Calculating ceteris paribus!: 100%|█████████████████████████████████████████| 7/7 [00:00<00:00, 45.00it/s]
Calculating ceteris paribus!: 100%|█████████████████████████████████████████| 7/7 [00:00<00:00, 51.99it/s]
Out[18]:
gender age class embarked fare sibsp parch _yhat_ _vname_ _ids_ _label_
Mary female 35.000000 3st Cherbourg 25.0 0 0 0.978302 gender Mary Titanic MLP Pipeline
Mary male 35.000000 3st Cherbourg 25.0 0 0 0.237394 gender Mary Titanic MLP Pipeline
Mary female 0.166667 3st Cherbourg 25.0 0 0 0.999749 age Mary Titanic MLP Pipeline
Mary female 2.000000 3st Cherbourg 25.0 0 0 0.999640 age Mary Titanic MLP Pipeline
Mary female 4.000000 3st Cherbourg 25.0 0 0 0.999461 age Mary Titanic MLP Pipeline
In [19]:
cp_mary.plot(cp_john)
In [20]:
cp_john.plot(variable_type = "categorical")

model_performance

  • 'classification'

  • 'regression'

This function calculates various Model Performance measures:

  • classification: F1, accuracy, recall, precision and AUC
  • regression: mean squared error, R squared, median absolute deviation
In [21]:
mp = exp.model_performance(model_type = 'classification')
mp.result
Out[21]:
recall precision f1 accuracy auc
0 0.651195 0.813708 0.723437 0.839601 0.877274
In [22]:
mp.result.auc[0]
Out[22]:
0.8772742315184608
In [23]:
mp.plot()

model_parts

  • 'variable_importance'

  • 'ratio'

  • 'difference'

This function calculates Variable Importance.

In [24]:
vi = exp.model_parts()
vi.result
Out[24]:
variable dropout_loss label
0 _full_model_ 0.345622 Titanic MLP Pipeline
1 embarked 0.366973 Titanic MLP Pipeline
2 parch 0.367799 Titanic MLP Pipeline
3 fare 0.368883 Titanic MLP Pipeline
4 sibsp 0.375249 Titanic MLP Pipeline
5 age 0.396493 Titanic MLP Pipeline
6 class 0.433561 Titanic MLP Pipeline
7 gender 0.504440 Titanic MLP Pipeline
8 _baseline_ 0.558109 Titanic MLP Pipeline
In [25]:
vi.plot(max_vars=5)

model_profile

  • 'partial'

  • 'accumulated'

This function calculates explanations that explore model response as a function of selected variables.

The explanations can be calulated as Partial Dependence Profile or Accumulated Local Dependence Profile.

In [26]:
pdp_num = exp.model_profile(type = 'partial')
pdp_num.result["_label_"] = 'pdp'
Calculating ceteris paribus!: 100%|█████████████████████████████████████████| 7/7 [00:00<00:00,  9.52it/s]
In [27]:
ale_num = exp.model_profile(type = 'accumulated')
ale_num.result["_label_"] = 'ale'
Calculating ceteris paribus!: 100%|█████████████████████████████████████████| 7/7 [00:00<00:00,  9.64it/s]
Calculating accumulated dependency!: 100%|██████████████████████████████████| 4/4 [00:00<00:00,  4.23it/s]
In [28]:
pdp_num.plot(ale_num)
In [29]:
pdp_cat = exp.model_profile(type = 'partial', variable_type='categorical', variables = ["gender","class"])
pdp_cat.result['_label_'] = 'pdp'
ale_cat = exp.model_profile(type = 'accumulated', variable_type='categorical', variables = ["gender","class"])
ale_cat.result['_label_'] = 'ale'
Calculating ceteris paribus!: 100%|█████████████████████████████████████████| 7/7 [00:00<00:00,  8.56it/s]
Calculating ceteris paribus!: 100%|█████████████████████████████████████████| 7/7 [00:00<00:00,  8.74it/s]
Calculating accumulated dependency!: 100%|██████████████████████████████████| 2/2 [00:00<00:00,  3.60it/s]
In [30]:
ale_cat.plot(pdp_cat)

References

In [ ]: